Cross-Validation for SES and MMPC: Cross-Validation for SES and MMPC

Description

The function performs a k-fold cross-validation for identifying the best values for the SES and MMPC 'max_k' and 'threshold' hyper-parameters.

Usage

cv.ses(target, dataset, kfolds = 10, folds = NULL, alphas = NULL, max_ks = NULL,
task = NULL, metric = NULL, modeler = NULL, ses_test = NULL, ncores = 1)
cv.mmpc(target, dataset, kfolds = 10, folds = NULL, alphas = NULL, max_ks = NULL,
task = NULL, metric = NULL, modeler = NULL, mmpc_test = NULL, ncores = 1)

Arguments

target

The target or class variable as in SES and MMPC.

dataset

The dataset object as in SES and MMPC.

kfolds

The number of the folds in the k-fold Cross Validation (integer).

folds

The folds of the data to use (a list generated by the function generateCVRuns TunePareto). If NULL the folds are created internally with the same function.

alphas

A vector of SES or MMPC thresholds hyper parameters used in CV. Default is c(0.1, 0.05, 0.01).

max_ks

A vector of SES or MMPC max_ks parameters used in CV. Default is c(3, 2).

task

A character ("C", "R" or "S"). It can be "C" for classification (logistic, multinomial or ordinal regression), "R" for regression (robust and non robust linear regression, median regression, poisson and negative binomial regression, beta regression), "S" for survival regresion (cox or Weibull regression).

metric

A metric function provided by the user. If NULL the following functions will be used: auc.mxm, mse.mxm, ci.mxm for classification, regression and survival analysis tasks, respectively. See details for more. If you know what you have put it here to avoid the function chopsing somehting else.

modeler

A modeling function provided by the user. If NULL the following functions will be used: glm.mxm, lm.mxm, coxph.mxm for classification, regression and survival analysis tasks, respectively. See details for more. If you know what you have put it here to avoid the function chopsing somehting else.

ses_test

A function object that defines the conditional independence test used in the SES function (see also SES help page). If NULL, testIndFisher, testIndLogistic and censIndLR are used for classification, regression and survival analysis tasks, respectively. If you know what you have put it here to avoid the function chopsing somehting else.

mmpc_test

A function object that defines the conditional independence test used in the MMPC function (see also SES help page). If NULL, testIndFisher, testIndLogistic and censIndLR are used for classification, regression and survival analysis tasks, respectively.

ncores

This argument is valid only if you have a multi-threaded machine.

Value

cv_results_all: A list with predictions, performances and signatures for each fold and each SES or MMPC configuration (e.g cv_results_all[[3]]$performances[1] indicates the performance of the 1st fold with the 3d configuration of SES or MMPC). In the case of the multi-threaded functions (cvses.par and cvmmpc.par) this is a list with a matrix. The rows correspond to the folds and the columns to the configurations (pairs of threshold and max_k).
best_performance: A numeric value that represents the best average performance.
BC_best_perf: A numeric value that represents the bias corrected best average performance.
best_configuration: A list that corresponds to the best configuration of SES or MMPC including id, threshold (named 'a') and max_k.

Details

Input for metric functions: predictions: A vector of predictions to be tested. test_target: target variable actual values to be compared with the predictions.

The output of a metric function is a single numeric value. Higher values indicate better performance. Metric based on error measures should be modified accordingly (e.g., multiplying the error for -1)

The metric functions that are currently supported are:

auc.mxm: "area under the receiver operator characteristic curve" metric, as provided in the package ROCR.
acc.mxm: accuracy metric.
mse.mxm: -1 * (mean squared error), for robust and non robust linear regression and median (quantile) regression.
ci.mxm: 1 - concordance index as provided in the rcorr.cens function from the Hmisc package. This is to be used with the Cox proportional hazards model only.
ciwr.mxm concordance index as provided in the rcorr.cens function from the Hmisc package. This is to be used with the Weibull regression model only.
poisdev.mxm: Poisson regression deviance.
nbdev.mxm: Negative binomial regression deviance.
ord_mae.mxm: Ordinal regression mean absolute error.

Usage: metric(predictions, test_target)

Input of modelling functions: train_target: target variable used in the training procedure. sign_data: training set. sign_test: test set.

Modelling functions provide a single vector of predictions obtained by applying the model fit on sign_data and train_target on the sign_test

The modelling functions that are currently supported are:

glm.mxm: fits a glm for a binomial family (Classification task).
lm.mxm: fits a linear model model (stats) for the regression task.
coxph.mxm: fits a cox proportional hazards regression model for the survival task.
weibreg.mxm: fits a Weibull regression model for the survival task.
rq.mxm: fits a quantile (median) regression model for the regression task.
lmrob.mxm: fits a robust linear model model for the regression task.
pois.mxm: fits a poisson regression model model for the regression task.
nb.mxm: fits a negative binomial regression model model for the regression task.
multinom.mxm: fits a multinomial regression model model for the regression task.
ordinal.mxm: fits an ordinal regression model model for the regression task.
beta.mxm: fits a beta regression model model for the regression task. The predicted values are transformed into $R$ using the logit transformation. This is so that the "mse.mxm" metric function can be used. In addition, this way the performance can be compared with the regression scenario, where the logit is applied and then a regression model is employed.

Usage: modeler(train_target, sign_data, sign_test)

Note that the Tibshirani and Tibshirani (2009) bias correction method is applied. The procedure will be more automated in the future and more functions will be added. The multithreaded functions have been tested and no error has been detected. However, if you spot any suspicious results please let us know.

References

Tibshirani R.J., and Tibshirani R. (2009). A bias correction for the minimum error rate in cross-validation. The Annals of Applied Statistics 3(2): 822-829.

Examples

Run this code

set.seed(1234)

# simulate a dataset with continuous data
dataset <- matrix( rnorm(100 * 100), ncol = 100 )
# the target feature is the last column of the dataset as a vector
target <- dataset[, 100]
dataset <- dataset[, -100]

# get 50 percent of the dataset as a train set
train_set <- dataset[1:50, ]
train_target <- target[1:50]

require(hash)
# run a 10 fold CV for the regression task
best_model = cv.ses(target = train_target, dataset = train_set, kfolds = 10, task = "R")

# get the results
best_model$best_configuration
best_model$best_performance

# summary elements of the process. Press tab after each $ to view all the elements and
# choose the one you are intresting in.
# best_model$cv_results_all[[...]]$...
#i.e.
# mse value for the 1st configuration of SES of the 5 fold
abs(best_model$cv_results_all[[1]]$performances[5])

best_a <- best_model$best_configuration$a
best_max_k <- best_model$best_configuration$max_k

Run the code above in your browser using DataLab